This is a countinuation to the preliminary EDA analysis on the data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import plotly.express as px
from plotly.subplots import make_subplots
from statsmodels.graphics.tsaplots import month_plot
from statsmodels.api import tsa
import plotly.graph_objs as go
import calendar
# Note to self to trim the packages for this EDA, not all of these will be used at this time.
df = pd.read_csv('data/weatherstats_vancouver_hourly_clean.csv')
df.head()
| date_time_local | pressure_station | pressure_sea | wind_dir | wind_speed | wind_gust | relative_humidity | dew_point | temperature | windchill | humidex | visibility | health_index | cloud_okta | max_air_temp_pst1hr | min_air_temp_pst1hr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-07-01 00:00:00 | 101.18 | 101.16 | SSE | 7 | 0.0 | 91 | 18.2 | 19.7 | 0.0 | 0.0 | 32200.0 | 2.9 | 5.0 | 19.4 | 18.5 |
| 1 | 2013-07-01 01:00:00 | 101.22 | 101.21 | SE | 6 | 0.0 | 89 | 17.8 | 19.6 | 0.0 | 0.0 | 32200.0 | 3.0 | 5.0 | 20.1 | 18.7 |
| 2 | 2013-07-01 02:00:00 | 101.26 | 101.24 | E | 11 | 0.0 | 88 | 16.7 | 18.7 | 0.0 | 0.0 | 32200.0 | 3.0 | 5.0 | 19.8 | 18.0 |
| 3 | 2013-07-01 03:00:00 | 101.26 | 101.25 | E | 4 | 0.0 | 84 | 16.5 | 19.2 | 0.0 | 0.0 | 32200.0 | 2.7 | 5.0 | 18.5 | 17.5 |
| 4 | 2013-07-01 04:00:00 | 101.30 | 101.28 | NNW | 5 | 0.0 | 87 | 15.7 | 17.9 | 0.0 | 0.0 | 32200.0 | 2.6 | 5.0 | 18.8 | 17.3 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 87648 entries, 0 to 87647 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time_local 87648 non-null object 1 pressure_station 87648 non-null float64 2 pressure_sea 87648 non-null float64 3 wind_dir 87648 non-null object 4 wind_speed 87648 non-null int64 5 wind_gust 87648 non-null float64 6 relative_humidity 87648 non-null int64 7 dew_point 87648 non-null float64 8 temperature 87648 non-null float64 9 windchill 87648 non-null float64 10 humidex 87648 non-null float64 11 visibility 87648 non-null float64 12 health_index 87648 non-null float64 13 cloud_okta 87648 non-null float64 14 max_air_temp_pst1hr 87648 non-null float64 15 min_air_temp_pst1hr 87648 non-null float64 dtypes: float64(12), int64(2), object(2) memory usage: 10.7+ MB
df['date_time_local'] = pd.to_datetime(df['date_time_local'], utc=False)
df = df.set_index('date_time_local')
df.head()
| pressure_station | pressure_sea | wind_dir | wind_speed | wind_gust | relative_humidity | dew_point | temperature | windchill | humidex | visibility | health_index | cloud_okta | max_air_temp_pst1hr | min_air_temp_pst1hr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date_time_local | |||||||||||||||
| 2013-07-01 00:00:00 | 101.18 | 101.16 | SSE | 7 | 0.0 | 91 | 18.2 | 19.7 | 0.0 | 0.0 | 32200.0 | 2.9 | 5.0 | 19.4 | 18.5 |
| 2013-07-01 01:00:00 | 101.22 | 101.21 | SE | 6 | 0.0 | 89 | 17.8 | 19.6 | 0.0 | 0.0 | 32200.0 | 3.0 | 5.0 | 20.1 | 18.7 |
| 2013-07-01 02:00:00 | 101.26 | 101.24 | E | 11 | 0.0 | 88 | 16.7 | 18.7 | 0.0 | 0.0 | 32200.0 | 3.0 | 5.0 | 19.8 | 18.0 |
| 2013-07-01 03:00:00 | 101.26 | 101.25 | E | 4 | 0.0 | 84 | 16.5 | 19.2 | 0.0 | 0.0 | 32200.0 | 2.7 | 5.0 | 18.5 | 17.5 |
| 2013-07-01 04:00:00 | 101.30 | 101.28 | NNW | 5 | 0.0 | 87 | 15.7 | 17.9 | 0.0 | 0.0 | 32200.0 | 2.6 | 5.0 | 18.8 | 17.3 |
df2 = df[['wind_speed', 'wind_gust', 'temperature', 'windchill', 'humidex', 'dew_point']]
# To further analyze the temperature I will resample the data to monthly averages.
df2_monthly = df2.resample("MS").mean()
# Using the folloting code to check that the averages are applied.
df2_monthly.head()
| wind_speed | wind_gust | temperature | windchill | humidex | dew_point | |
|---|---|---|---|---|---|---|
| date_time_local | ||||||
| 2013-07-01 | 13.928763 | 0.000000 | 18.457527 | 0.000000 | 8.788978 | 13.483737 |
| 2013-08-01 | 12.366935 | 0.342742 | 18.443145 | 0.000000 | 8.071237 | 14.270833 |
| 2013-09-01 | 12.741667 | 2.688889 | 15.384444 | 0.000000 | 2.008333 | 13.043194 |
| 2013-10-01 | 9.686828 | 0.580645 | 9.302554 | 0.000000 | 0.029570 | 7.311559 |
| 2013-11-01 | 11.680999 | 3.586685 | 6.237032 | -0.223301 | 0.000000 | 4.013454 |
# This graph will show us the monthly averages over the years.
fig = px.line(df2_monthly, x=df2_monthly.index, y='temperature',)
fig.update_layout(
yaxis_title="Degrees",
xaxis_title="Year",
legend_title="",
title="Monthly Temperature Average from July 2013 - June 2023"
)
fig.show()
Most July and August months have had the high temperatures, but August 2022 had the record highest temperature of all months by at least half a Celsius degree.
# Now we want to look at the Trend, Seasonal, and Residuals lines for further insights.
decomposition = sm.tsa.seasonal_decompose(df2_monthly["temperature"], model='additive')
# We need to create new columns for these values in our monthly average data set.
df2_monthly["Monthly Trend"] = decomposition.trend
df2_monthly["Monthly Seasonal"] = decomposition.seasonal
df2_monthly["Monthly Residual"] = decomposition.resid
# This graph will plot only these columns for the temperature.
cols = ["Monthly Trend", "Monthly Seasonal", "Monthly Residual"]
fig = make_subplots(rows=3, cols=1, subplot_titles=cols)
for i, col in enumerate(cols):
fig.add_trace(
go.Scatter(x=df2_monthly.index, y=df2_monthly[col]),
row=i+1,
col=1
)
fig.update_layout(height=800, width=1200, showlegend=False)
fig.show()
The trend line in the monthly averages still shows the higher temperatures overall in August and July, as is to be expected on the hottest months. Interestingly, the trend does not show a sharp increase in the overall change in temperature in 2021. It shows that sharp increase in 2015 instead. Overall the trend shows that while the temperature overall has increased it has also "leveled out". The trend line does not consider the first six or the last six months. </br> The Seasonal line shows us what we already expected, high temperatures in the summer, lower temperarures in the winter. </br> The Residual line shows no patterns which we can use as confirmation that the data patterns and variability have been considered in the Trend and Seasonal Lines.
# To look further into the patterns and comparisons by month per each year, we will plot the Seasonal Difference.
df2_monthly["Monthly Seasonal_Difference"] = df2_monthly["temperature"].diff(12)
# The graph will not show the first twelve months as it calculates a rolling average.
fig = px.line(df2_monthly, x=df2_monthly.index, y="Monthly Seasonal_Difference")
fig.update_layout(
yaxis_title="Difference (temperature)",
xaxis_title="Date",
title="Change in Monthly Temperature Comparison"
)
fig.show()
Concentrating in the month of June accross the years, we can see that June 2021 had, on average a temperature 2.43 higher than it is expected on a typical June. The rest of the years June has not shown this drastic increase on average. The highest temperature average is typically in August any given year. That one record high temperature in 2021 appears to be an outlier, but it would be interesting to see what contributed to June 29, 2021 being particularly hot at 15:00. The trend line increase in 2015 could be explained by the difference in temperature accross the months in 2015 when compared to their corresponding typical months. For example, February 2015 was approximatley 5 degrees higher than a typical February.
# Looking at the data on a weekly average may reveal more insights.
df2_weekly = df2.resample("W").mean()
fig = px.line(df2_weekly, x=df2_weekly.index, y='temperature',)
fig.update_layout(
yaxis_title="Degrees",
xaxis_title="Year",
legend_title="",
title="Weekly Temperature Average from July 2013 - June 2023"
)
fig.show()
The weekly averages for temperature are already noticeably higher. That being said, the pattern still shows the summer and winter months as it should. However, the summers of 2021 and 2022 seems to have had higher weekly average temperatures by more than half a degree.
# As we did with the monthly averages, we will review the Trend, Seasonal, and Residual lines for weekly averages.
decomposition = sm.tsa.seasonal_decompose(df2_weekly["temperature"], model='additive')
df2_weekly["Weekly Trend"] = decomposition.trend
df2_weekly["Weekly Seasonal"] = decomposition.seasonal
df2_weekly["Weekly Residual"] = decomposition.resid
cols = ["Weekly Trend", "Weekly Seasonal", "Weekly Residual"]
fig = make_subplots(rows=3, cols=1, subplot_titles=cols)
for i, col in enumerate(cols):
fig.add_trace(
go.Scatter(x=df2_weekly.index, y=df2_weekly[col]),
row=i+1,
col=1
)
fig.update_layout(height=800, width=1200, showlegend=False)
fig.show()
The trend line here shows a similar pattern as in the monthly average where there was higher temperature overall in 2015 and that temperature leveled off in later years. The temperatures still increased, but not as drastically as 2015. </br> The seasonal as residual lines in this case do not give us any additional insights.
# Now we move on to see the Seasonal Difference on the weekly average.
df2_weekly["Weekly Seasonal_Difference"] = df2_weekly["temperature"].diff(52)
fig = px.line(df2_weekly, x=df2_weekly.index, y="Weekly Seasonal_Difference")
fig.update_layout(
yaxis_title="Difference (temperature)",
xaxis_title="Date",
title="Change in Weekly Temperature Comparison"
)
fig.show()
To further support the idea of 2015 having higher temperatures outside of summer, the week of February 8th has an 11.49 degree difference from what that week's typical temperature is. The differences in weeks in July and August for 2021 and 2022 vary from almost no difference when compared to that typical week to approximately 6.5 degrees. </br></br> When considering the weekly averages the temperatures do not seem to be as different than their corresponding typical weeks' temperatures in the summer. For example, the week that includes the June 29th is only 4 degrees higher when compared to that typical week's temperature. The highest differences seem to be overall in the weeks of the winter months. This could indicate that overall the winters have been getting less cold on average, which is interesting to note as the trend we have heard about in the news is that winters have been getting colder.
# As a last step of this part of the analysis, we will now look at daily averages.
df2_daily = df2.resample("D").mean()
fig = px.line(df2_daily, x=df2_daily.index, y='temperature',)
fig.update_layout(
yaxis_title="Degrees",
xaxis_title="Year",
legend_title="",
title="Daily Temperature Average from July 2013 - June 2023"
)
fig.show()
Despite the record high hourly temperature on June 29, 2021, the highest average daily temperature in the past ten years, 26.42, was actually on June 28, 2021. The lowest average daily temperature, -10.85, does line up with that hourly low we mentioned previously on December 27, 2021.
decomposition = sm.tsa.seasonal_decompose(df2_daily["temperature"], model='additive')
df2_daily["Daily Trend"] = decomposition.trend
df2_daily["Daily Seasonal"] = decomposition.seasonal
df2_daily["Daily Residual"] = decomposition.resid
cols = ["Daily Trend", "Daily Seasonal", "Daily Residual"]
fig = make_subplots(rows=3, cols=1, subplot_titles=cols)
for i, col in enumerate(cols):
fig.add_trace(
go.Scatter(x=df2_daily.index, y=df2_daily[col]),
row=i+1,
col=1
)
fig.update_layout(height=800, width=1200, showlegend=False)
fig.show()
The trend line for the daily average is significantly different than that of the weekly and monthly averages. It shows a slightly higher temperature trend in the summer days of 2021 and 2022 as well as a lower temperature trend on the winter days. This could indicate that when the data is grouped in weeks or months, the fluctuations in temperature in the short term are flattened out. </br> Its hard to interpret the seasonal line in this graph, but it seems to indicate the rise in temperature in the day and drop in temperature in the evening. </br> The Residual does not show a particular pattern here either, which lines up with the previous observations on the monthly average analysis.
df2_daily["Daily Seasonal_Difference"] = df2_daily["temperature"].diff(12)
fig = px.line(df2_daily, x=df2_daily.index, y="Daily Seasonal_Difference")
fig.update_layout(
yaxis_title="Difference (temperature)",
xaxis_title="Date",
title="Change in Daily Temperature Comparison"
)
fig.show()
The daily differences here indicate once more that the summer months, in this case as seen through the daily averages, have similar differences accross the years when compared to their respective typical days. </br></br> Considering the observations of this analysis, it would be interesting to ascertain how the different variables affect temperature and what could be indicators for a particularly high temperature day.